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Abstract 

Point clouds are sets of points in two or three dimensions. Most kernel methods for learning on 
sets of points have not yet dealt with the specific geometrical invariances and practical constraints 
associated with point clouds in computer vision and graphics. In this paper, we present extensions 
of graph kernels for point clouds, which allow to use kernel methods for such objects as shapes, 
line drawings, or any three-dimensional point clouds. In order to design rich and numerically 
efficient kernels with as few free parameters as possible, we use kernels between covariance matrices 
and their factorizations on graphical models. We derive polynomial time dynamic programming 
recursions and present applications to recognition of handwritten digits and Chinese characters 
from few training examples. 

1. Introduction 

In recent years, kernels for structured data have been designed in many domains, such as bioin- 
formatics [lj], speech processing 0], text processing Q and computer vision They provide an 
elegant way of including known a priori information, by using directly the natural topological struc- 
ture of objects. Using a priori knowledge through structured kernels have proved beneficial because 
it allows to reduce the number of training examples, and to re-use existing data representations that 
are already well developed by experts of those domains. 

In this paper, we propose a kernel between point clouds, with applications to classification of 
line drawings (such as handwritten digits [BJ or Chinese characters [1, 0] ) or shapes 8 1 . The natural 
geometrical structure of point clouds is hard to represent in a few real- valued features [9( , in particular 
because of (a) the required local or global invariances by rotation, scaling, and/or translation, (b) the 
lack of pre-established registrations of the point clouds (i.e., points from one cloud are not matched 
to points from another cloud), and (c) the noise and occlusion that impose that only portions of two 
point clouds ought to be compared. 

Following one of the leading principles for designing kernels between structured data, we propose 
to look at all possible partial matches between two point clouds [lj| • More precisely, we assume that 
each point cloud has a graph structure (most often a neighborhood graph), and we consider recently 



introduced graph kernels [111 |12| |. Intuitively, these kernels consider all possible subgraphs and 
compare and count the matching subgraphs. However, the set of subgraphs (or even the set of paths) 
has exponential size and cannot be efficiently described recursively; so larger sets of substructures 
are commonly used, e.g., walks and tree- walks. As shown in Section [21 by choosing appropriate 
substructures and fully factorized local kernels, efficient dynamic programming implementations 
allow to sum over an exponential number of substructures in polynomial time. The kernel thus 
provides an efficient and elegant way of considering very large feature spaces (see, e.g., [TojV 

However, in the context of computer vision, substructures correspond to matched sets of points, 
and dealing with local invariances imposes to use a local kernel that cannot be readily expressed as 
a product of separate terms for each pair of points, and the usual dynamic programming approaches 
cannot then be applied. The main contribution of this paper is to design a local kernel that is not 
fully factorized but can be instead factorized according to the graph underlying the substructure. 
This is naturally done through graphical models and the design of positive kernels for covariance 
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Figure 1: (left) path, (center) 1-walk which is not a 2- walk, (right) 2-walk which is not a 3- walk. 



matrices that factorize on graphical models (Section [3]). With this novel local kernel, we derive new 
polynomial time dynamic programming recursions in Section 0J In Section we present simulations 
on handwritten character recognition. 



2. Graph kernels 

In this section, we consider two labelled undirected graphs G = (V,E,a,x) and H = (W,F,b,y). 
Two types of labels are considered: attributes, which are denoted a(v) £ A for vertex v £ V and 
b(w) £ A for vertex w £ W and positions, which are denoted x(v) £ X and y(w) £ X. Our 
motivating examples are line drawings, where X = A = M. 2 . 

The definition of graph kernels between G and H relies on a set of substructures of the graphs. 
The most natural ones are paths, subtrees and more generally subgraphs; however, they do not lead 



to efficient enumerations, and recent work [11|, [13j has focused on larger sets of substructures that 
we now present. 



2.1 Paths, walks, subtrees and tree-walks 

Given the undirected graph G with vertex set V , a path is a sequence of distinct connected vertices, 
while a walk is a sequence of possibly non distinct connected vertices. For any positive integer (3, 
we define (3- walks as walks such that any (3+1 successive vertices are distinct (1- walks are regular 
walks). Note that when the graph G is a tree (no cycles), then the set of 2-walks is equal to the 
set of paths (see examples in Figured]). More generally, for any graph, /3-walks of length (3+1 are 
exactly paths of length (3+1. 

A subtree is a subgraph of with no cycles. We can represent a subtree of G by a tree structure 
T over the vertex set {1,...,|T|}, where \T\ is the number of nodes in T, and a sequence of distinct 
consistent labels / £ (i.e., that are neighbors in G when neighbors in T). In this paper, we 
consider only rooted subtrees, i.e., subtrees where a specific node is identified as the root^j 

The notion of walk is extending the notion of path by allowing nodes to be equal. Similarly, we 
can extend the notion of subtrees to tree-walks, which can have nodes that are equal. More precisely, 
we define an a-ary tree- walk of depth 7 of G as a (non complete) labelled a-ary tree of depth 7 with 
nodes labelled by vertices in G, and such that the labels of neighbors in the tree- walk are neighbors 
in G. A tree-walk may be represented by a tree structure T over the vertex set {1, . . . , |T|} and a 
sequence of consistent but possibly non distinct labels / £ yl T L We can also define /3-tree- walks, 
as tree-walks such that for each node in T, its label (i.e. an element of V) and the ones of all its 
descendants up to the /3-th generation are all distinct. With that definition, 1-tree- walks are regular 
tree- walks (see Figure [5]). Note that if a = 1, w e g et back /3-walks and the graph kernels that we 
use are often referred to as random walk kernels [111 ] . From now on, we refer to the descendants up 
to the /3-th generation as the /3-descendants. 

We let denote 7^, j7 the set of tree structures of depth less than 7 and with at most a children 
per node. For T £ T a ^, we define Jp{T,G) the set of consistent labellings of T by vertices in 
V leading to /3-tree- walks. With these definitions, a /3-tree-walk of G is characterized by a tree 
structure T £ T an and a labelling / £ Jp(T, G). 

1. Moreover, all the trees that we consider are unordered trees (i.e., no order is considered among siblings). 
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Figure 2: (left) 2-tree-walk, (center) 1-tree-walk which is not a 2-tree-walk. 



2.2 Graph kernels 

Assuming a local kernel qr,i,j{G, H) between two tree- walks that share the same structure is given, 



following 11|, we can define the tree-kernel as the sum over all matching tree- walks of G and H of 



the local kernel: 

kZ P „(G,H)= £ MT) £ £ qrjAG,H). (1) 

T6T Q , T IeJp(T,G) JeJp(T,H) 

If the kernel qT,i,j(G, H) has positive values and is equal to 1 if the two tree- walks are equal, it can 
be seen as a soft matching indicator, and then the kernel in Eq. |T]) simply counts the softly matched 
tree-walks in the two graphs. 

We add a nonnegative penalization f\ lV (T) depending only on the tree-structure. Besides the 
usual penalization of the number of nodes \T\, we also add a penalization of the number of leaf nodes 



£(T). More precisely, we use the penalization f\, v = Al T lz/( T ). This penalization suggested by 13 1 
is essential in our situation to avoid that trees with nodes of higher degrees dominate the sum. 

If Qt,i,j(G,H) is a positive kernel between G and H, then k^ p (G, H) also defines a positive 
kernel. The kernel k^ M {G,H) sums the local kernel qT,i,j(G, H) overall all tree- walks of G and H 
that share the same tree structure. The number of matching tree-walks is exponential in the depth 
7, thus, in order to deal with potentially deep trees, a recursive definition is needed. It requires a 
specific type of local kernels. 



2.3 Local kernel 

We use a combination (product) of a kernel for attributes and a kernel for positions. For attributes, 
we use the following usual factorized form fe^(a(I), b(J)) = nj>=i ^A(a(I P ), b(J p ), where kj^ is a 
positive definite kernel on A x A. This allows the separate comparison of each matched pair of 
points and efficient dynamic programming recursions [lH However, for our local kernel on 
positions, we need a kernel that jointly depends on the whole vectors x(I) and y(J), and not only 
on pairs (x(I p ),y(J p )). 

In this paper, we focus on X = M. d and translation invariant local kernels, which implies that the 
local kernel for positions may only depend on differences x(i) — x(i') and y(j) — y(j') for (i, i r ) £ I X I 
and {j,f) G J x J. We further reduce these to the kernel matrices corresponding to a translation 
invariant kernel kx(x — x'). Depending on the application, kx may or may not be rotation invariant. 

That is, we define full kernel matrices K G Rj y l x l v l and L £ M} w ^ x ^ w ^ for each graph, defined as 
K (v, v') = kx{x(v) — x(v')) (and similarly for L). For simplicity, we assume that these matrices are 
positive definite (i.e., invertible), which can be enforced by adding a multiple of the identity matrix. 
The local kernel will thus only depend on the submatrices Kj = Kjj and Lj = Lj j, which are 
positive definite matrices. Note that we use kernel matrices K and L to represent the geometry of 
each graph, and that we use a kernel on such kernel matrices. 

We use the following kernel on positive matrices K and L, the (squared) Bhattacharyya kernel 
feg, defined as 

k B (K,L) = \K\V*\L\ 1 /*\l^\-\ (2) 
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Figure 3: (left) rooted tree structure T, (middle) decomposable graphical model Qi(T), denned in 
Section [3. A\ (right) corresponding rooted junction tree with cliques (ellipses) and separa- 
tors (rectangles). By convention, all trees are rooted from top to bottom. 



which is a positive kernel with pointwise positive values and such that kg(K,K) = 1 (\K\ denotes 
the determinant of K). 

By taking the product of the attribute-based local kernel and the position-based local kernel, 
we get the following local kernel j j(G,H) — k^(Kj, L j)k^{Aj, Bj). However, this local kernel 

i,j(G,H) does not yet depend on the tree structure T and the recursion may be efficient only 
if qt can be computed recursively. The factorized term kj\,(Aj, Bj) does not cause any problems; 
however, for the term k^(Kj, Lj), we need an approximation based on T. As we show in Section[3J 
this can be obtained by a factorization according to the appropriate graphical model. 

3. Positive matrices and graphical models 

The main idea underlying the factorization of the kernel is to consider symmetric positive matrices 
as covariance matrices and look at graphical models defined for Gaussian random vectors with 
those covariance matrices. In this section we assume that we have n random variables Z\, . . . , Z n 
with probability distribution p(z) = p(z\, . . . ,z n ). Given a kernel matrix K (in our case defined 
as Kij = e~ a ^ Xi ~ X] II , for positions x\, . . . ,x n ), we consider random variables Zi, . . . , Z n such that 
cov(Zi, Zj) = K^. In this section, with this identification, we consider covariance matrices as kernel 
matrices, and vice- versa. 

3.1 Graphical models and junction trees 

Graphical models provide a flexible and intuitive way of defining factorized probability distributions. 
Given any undirected graph Q with vertices in {1, . . . , n}, the distribution p(z) is said to factorize in 
Q if it can be written as a product of potentials over all cliques (completely connected subgraphs) of 
the graph Q. When the distribution is Gaussian with covariance matrix K € R" xn , the distribution 
factorizes if and only if (K^ 1 )^ — for each which is not an edge in Q (TEI. [la|. 

In this paper, we only consider decomposable graphical models, for which the graph Q is tri- 
angulated (i.e., there exists no chordless cycles of length larger than 4). In this case, the joint 
distribution is uniquely defined from its marginals p(zc) on the cliques C of the graph Q. Namely, 
if C(Q) is the set of maximal cliques of Q, we can build a tree of cliques, a junction tree, such that 
p( z ) = ricec(Q) p( z c)/ Tic CeC(Q) c~C P( z cnc)- (see Figure[3]for an example of graphical model 
and junction tree). The sets C D C are usually referred to as separators and we let denote S(Q) the 
set of such separators. 
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3.2 Graphical models and projections 

We let denote Hq(K) the covariance matrix that factorizes in Q which is closest to K for the 
Kullback-Leibler divergence between normal distributions with covariance matrices K and L. In 
this paper, we essentially replace K by Hq(K); i.e., we project all our covariance matrices onto a 
graphical model, which is a classical tool in probabilistic modelling 16]. We leave the study of the 
approximation properties of such a projection (in particular along the lines of (l7j ) to future work. 

Practically, since our kernel on kernel matrices involves determinant, we simply need to compute 
I1q (If) | efficiently. For decomposable graphical models, Hq(K) can be obtained in closed form 15 1 
and its determinant has the following simple expression: 

log |II Q 0fQI = Ecec(Q) \Kc\ ~ Es eS(Q ) log \K S \. (3) 

If the junction tree is rooted (by choosing any clique as the root), then for each clique but the root, 
a unique parent clique pq{C) is defined, and we have: 

log \n Q (K)\ = Ecec(Q) ^g = Ecec(Q) J °g \ K c\ PQ {C)\, (4) 

where pq (C) is the parent clique of Q (and for the root clique) and the conditional covariance is 
defined, as usual, as K c \ vq(c) = K c ,c ~ K c, PQ (C)K~^ (c) pQ[c) K pQ(c)fi . 

3.3 Graphical models and kernels 

We now propose several ways of defining a kernel adapted to graphical models. All of them are 
based on replacing determinants \M\ by |IIq(M)|, and their different decompositions in Eq. {3} and 
Eq. (J4j) . Using Eq. (j3j, we obtain the first kernel: 



L ) = Ucec {Q ) k B (K c , L C ) [Uses { Q) *b(K s , L s )) . (5) 
However, this is not always a positive kernel for general covariance matrices: 

Proposition 1 For any decomposable model Q, the kernel fcg defined in Eq. {5p is a positive kernel 
on the set of covariance matrices K such that for all separators S G S{Q), K$,s — I ■ In particular, 
when all separators have cardinal one, this is a kernel on correlation matrices. 

In order to remove the condition on separators, we consider the rooted junction tree representa- 
tion in Eq. ([4]) and define another kernel as 

k§(K,L) = Ucec( Q )k C B lPQiC \K,L). (6) 

For the root, we define k* l0 (K,L) = k B {K R ,L R ) and the kernels k^ PQ(C) {K, L) are defined as 
kernels between conditional Gaussian distributions of Zq given Z PQ rc) ■ We use 

k° lpQ{C) (K L) = \ K C\ PQ (C)\ 1/2 \Lc\p Q (C)\ 1/2 ^ 



^c| PQ (c) + 5 i cb Q (c) + i(^c,p Q (c)^ PQ 1 (c) -ia PQ (c)-£'p Q 1 ( c)) 6 

which corresponds to putting a prior with identity covariance matrix on variables ^n(Q) an d consid- 
ering the kernel between the resulting joint covariance matrices on variables indexed by (C,pg(C)) 
(we use the notation M® 2 = MM T ). We now have a positive kernel on all covariance matrices: 

Proposition 2 For any decomposable model Q, the kernel kg(K,L) defined in Eq. (0|) and Eq. ^ 

is a positive kernel on the set of covariance matrices. 
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Figure 4: (left) undirected graph G, (right) graph G\ t \. 



Note that the kernel is not invariant by the choice of the particular root of the junction tree. 
However, in our setting, this is not an issue because we have a natural way of rooting the junction 
trees (i.e, following the rooted tree- walk). 

In Section^ we will use the notation fcg 1 ' /2 ' Jl '' 72 (K, L) for |7i| = /2I and |Ji| = J2 to denote 
the kernel between covariance matrices Kj lLS i 2 and Li lU i 2 adapted to the conditional distributions 
Ix\I 2 and Ji| J 2 (defined through Eq. Q). 

3.4 Choice of graphical models 

Given the rooted tree structure T of a /3-tree-walk, we now need to define the graphical model Qp(T) 
that we use to project our kernel matrices. We define Qp(T) such that for all nodes in T, the node 
together with all its /^-descendants form a clique, i.e., a node is connected to its /3-descendants and 
all /3-descendants are also mutually connected (see Figure for example for (3 = 1): the set of cliques 
are thus the set of families of depth (3 (i.e., with (3 + 1 generations). Thus our final kernel is: 

kZ P „(G,H)= J2 f^( T ) E E k^iK^L^AuBj). (8) 

The main intuition behind this definition is to sum local similarities over all matching subgraphs. 
In order to obtain a tractable formulation, we simply needed to (a) extend the set of subgraphs (to 
tree-walks of depth 7) and (b) factorize the local similarities along the graphs. Note that the graph 
Qp(T) that we chose is the densest graph for which the following dynamic programming recursions 
may hold. 

4. Dynamic programming recursions 

In order to derive dynamic programming recursions, we follow [13[ and rely on the fact that a-ary 
/3-tree-walks of G can essentially be defined through 1-trce-walks on the augmented graph of all 
subtrees of G of depth at most (3 — 1 and arity less than a. 

We thus consider the set V a ,p of non complete rooted (unordered) subtrees of G = (V,E), of 
depths less than (3—1 and arity less than a. Given two different rooted unordered labelled trees, 
they are said equivalent if they share the same tree structure, and this is denoted ~(. 

On this set V ai p, we define a directed graph with edge set E a< p as follows: Rq 6 V a .p is connected 
to Ri S V a .p if "the tree Ri extends the tree Rq one generation further", i.e., if and only if (a) the 
first (3 — 2 generations of R\ are exactly equal to one of the complete subtree of Rq rooted at a child 
of the root of i?o, and (b) the nodes of depth (3 — 1 of R± are distinct from the nodes in Rq. This 
defines a graph G Qj( g — (y a ,p, E aj p) (see FigureS]). Similarly we define a graph H a< p — (W ai p, F ai p) 
for the graph H. 

For a /3-tree-walk, the root with its (3 — 1-descendants must have distinct vertices and thus 
correspond exactly to elements of V a fi- We let fcj - (G, H, Ro, So) denote the same kernel as 
defined in Eq. ([8]), but restricted to tree- walks that start respectively start with Rq and So. Note 
that if Rq and So are not equivalent, then H, Rq, So) = 
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Figure 5: For digits and Chinese characters: (left) Original characters, (right) thinned and subsam- 
pled characters 



We obtain the following recursion for all Rq 6 V a ,p and and Sq S W a ,p such that R ~ t So'- 

a 

e A7 (G,i?,i?o,^o) = E A ^ (p " 1)+ E E 

p ~° Ri, ■ ■ ■ , R P c Afo aj3 (Ro) Si,...,5 p cJVff iiJi (So) 

R%, . . . , R p disjoint Si,...,S p disjoint 

Note that if any of the trees i?i is not equivalent to Si, it does not contribute to the sum. The 
recursion is initialized with k^p , 7 (G, H, R , Sq) = Al fl °l^ fl °)/cx(^fl , ^So)^#(-^flo> -^So) and the 
final kernel is obtained as k?p^(G, H) = J2R ~ t s fc I/3, 7 ( G > H > R o, So). 

Note that we may reduce the computational load by considering a set of trees of smaller arity in 
the previous recursions; i.e., we consider Vi t /3 instead of V a ,p with tree-kernels of arity a > 1. 

4.1 Computational complexity 

The complexity of computing one kernel between two graphs is linear in 7 (the order of the tree- 
walks), and quadratic in the size of V a ^ and W a p. However, those sets have exponential size in 
(3 and a in general. And thus, we are limited to small values (typically a $J 3 and j3 ^ 6) which 
are sufficient for good classification performance (see Section [5]). For example, for the handwritten 
digits we use in simulations, the average number of nodes in the graphs are 18 ±4 , while the average 
cardinal of V a ,p is 37 ± 13 (a = 1, = 4), and 70 ± 44 (a = 2, (3 = 4). 

5. Application to handwritten character recognition 

We have tested our new kernels on the task of isolated handwritten character recognition, hand- 
written arabic numerals (MNIST dataset) and Chinese characters (ETL9B dataset). We selected 
the first 100 examples for the ten classes in the MNIST dataset, while for the ETL9B dataset, we 
selected the five hardest classes to discriminate among the 3,000 (by computing distances between 
class means) and then selected the first 50 examples per class. Our learning task it to classify those 
characters; we use a one-vs-one multiclass scheme with 2-norm support vector machines [Io| . 

We consider characters as drawings in R 2 , which are sets of possibly intersecting contours. Those 
are naturally represented as undirected planar graphs. We have thinned and subsampled uniformly 
each character to reduce the sizes of the graphs (see two examples in Figure [S]) . 

The kernel on positions is kx{x,y) — exp(— t\\x — y\\ 2 ) + K,5(x,y), but could take into account 
different weights on horizontal and vertical directions. We add the positions from the center of 
the bounding box as features, to take into account the global positions, i.e., we use k^(x,y) = 
exp(— v\\x — y\\' 2 ). This is necessary because the problem of handwritten character recognition is not 
globally translation invariant. 
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/3=1 


13 = 2 


= 4 


/3 = 6 


RBF 


MNIST - a = 1 
MNIST - a = 2 


8.6 ± 1.3 
7.3 ±2.0 


7.0 ±2.1 
6.8 ±1.0 


5.4 ± 1.0 
6.6 ±1.2 


5.6 ±2.2 

5.7 ± 1.0 


9.3 ±2.1 


ETL9B - a = 1 
ETL9B - a = 2 


38.8 ±7.0 
45.2 ±9.9 


34.8 ±7.0 
29.2 ± 10.7 


30.0 ±8.0 
31.2 ±6.7 


31.2 ±7.8 
25.6 ±4.3 


48.4 ±5.9 



Figure 6: Error rates (multiplied by 100) on handwritten character classification tasks 



In this paper we have defined a family of kernels, corresponding to different values of the following 
free parameters (shown with their possible values): 



a 


arity of tree- walks 


1, 2 


P 


order of tree-walks 


1, 2, 4, 6 


7 


depth of tree-walks 


1, 2, 4, 8, 16, 24 


A 


penalization on number of nodes 


1 


V 


penalization on number of leaf nodes 


.1, .01 


T 


bandwidth for kernel on positions 


.05, .01, .1 


K 


ridge parameter 


.001 


V 


bandwidth for kernel on attributes 


.05, .01, .1 



The first two sets of parameters (a, /?, 7, A, v) are parameters of the graph kernel, independent of 
the application, while the last set (r, k, v) are parameters of the kernels for attributes and positions. 
Note that with only a few important scale parameters (r and v) , we are able to characterize complex 
interactions between the vertices and edges of the graphs. In practice, this is very important, to 
avoid considering many distinct parameters for all sizes and topologies of subtrees. 

In simulations, we performed two loops of 5-fold cross-validation: in the outer loop, we consider 
5 different training folds with their corresponding testing folds. On each training fold, we consider 
all possible values of a and f3. For all of those values, we select all other parameters (including the 
regularization parameters of the SVM) by 5-fold cross-validation (the inner folds). Once the best 
parameters are found only by looking at the training fold, we train on the whole training fold, and 
test on the testing fold. We output the means and standard deviations of the testing errors for each 
testing fold. We compare the performance with the Gaussian-RBF kernel with bandwidth learned 
by cross-validation in the same way in Figure [6] 

These results show that our new family of kernels that use the natural structure of line drawings 
are outperforming the "blind" Gaussian-RBF kernel (error rate of 5.4% instead of 9.3% for the 
MNIST digits and 25.6% instead of 48.4% for the ETL9B characters). Note that for arabic numerals, 
best performance is achieved for a = 1 (walks instead of tree- walks) , which is not surprising since 
most digits have a linear structure (graphs are chains). On the contrary, for Chinese characters, 
best performance is achieved for binary tree- walks. 



6. Conclusion 

We have presented a new kernel for point clouds which is based on comparisons of local subsets of the 
point clouds. Those comparisons are made tractable by (a) considering subsets based on tree-walks 
and walks, and (b) using a specific factorized form for the local kernels between tree-walks, namely 
a factorization on a graphical model. 

Moreover, we have reported applications to handwritten character recognition where we showed 
that the kernels were able to capture the relevant information to allow good predictions from few 
training examples. We are currently investigating other domains of applications of points clouds, 
such as shape mining in computer vision [8j. 1 18j|. and prediction of protein functions and interactions 
from their three-dimensional structures 1191. 
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